Skip to content

HDDS-14913. Implement Scalable CSV Export for Unhealthy Containers in Recon UI.#10162

Draft
ArafatKhan2198 wants to merge 2 commits intoapache:masterfrom
ArafatKhan2198:csvExport2
Draft

HDDS-14913. Implement Scalable CSV Export for Unhealthy Containers in Recon UI.#10162
ArafatKhan2198 wants to merge 2 commits intoapache:masterfrom
ArafatKhan2198:csvExport2

Conversation

@ArafatKhan2198
Copy link
Copy Markdown
Contributor

@ArafatKhan2198 ArafatKhan2198 commented Apr 30, 2026

What changes were proposed in this pull request?

The Recon UI had no way for administrators to export unhealthy container data (Missing, Under-Replicated, Over-Replicated, etc.) at scale. For clusters with millions of containers, any streaming export over a long-running HTTP connection would be killed by network infrastructure (firewalls, load balancers, proxies) before completion.


Solution: Asynchronous Background Export with Queue

Instead of streaming data directly to the browser, this PR implements a server-side background job system that:

  1. Builds the export on the Recon node itself
  2. Splits large exports into 500K-record CSV chunks
  3. Archives them into a single TAR file
  4. Lets the user download the TAR from the browser when ready

Backend Changes

New: ExportJob model (ExportJob.java)

A data class representing one export job with fields:

  • jobId (UUID), userId, state (container state), status (QUEUED → RUNNING → COMPLETED/FAILED)
  • queuePosition, totalRecords, estimatedTotal, progressPercent
  • filePath (path to TAR on disk), submittedAt, startedAt, completedAt, errorMessage

New: ExportJobManager.java — the core engine

A Guice Singleton that runs for the lifetime of the Recon server:

  • Single-threaded executor — one export runs at a time, eliminating concurrent Derby database access
  • Global queue (max 4 jobs) — incoming requests beyond the limit return HTTP 429
  • 3-second cooldown between jobs (on the worker thread, transparent to users)
  • CSV splitting — every 500K records creates a new part file (e.g., part001.csv, part002.csv)
  • TAR archiving — all part files are archived using Archiver.create() into export_{state}_{userId}_{shortJobId}.tar
  • Progress tracking — runs a COUNT(*) before the cursor opens to calculate estimatedTotal; totalRecords increments live
  • Cleanup — temp CSV files and their directory are deleted after TAR is created
  • Synchronized submitJob() — prevents race conditions when multiple users submit simultaneously
  • getQueuePosition() — walks LinkedHashMap (insertion-order) to return 1-indexed position

ContainerEndpoint.java — new REST endpoints

Method Path Purpose
POST /api/v1/containers/unhealthy/export Submit a new export job
GET /api/v1/containers/unhealthy/export List all jobs (new)
GET /api/v1/containers/unhealthy/export/{jobId} Get one job's status
GET /api/v1/containers/unhealthy/export/{jobId}/download Stream the TAR to browser
DELETE /api/v1/containers/unhealthy/export/{jobId} Cancel a job

Queue-full (429) errors return JSON instead of Jetty's HTML error page.

ContainerHealthSchemaManager.java

  • Added getUnhealthyContainersCursor() — jOOQ lazy cursor for streaming DB records without holding them all in JVM heap
  • Added getUnhealthyContainersCount() — fast COUNT(*) used before the cursor opens for progress estimation

ReconServerConfigKeys.java

New config keys:

  • ozone.recon.export.worker.threads (default: 1)
  • ozone.recon.export.directory (default: /tmp/recon/exports)
  • ozone.recon.export.max.jobs.total (default: 10)

Frontend Changes (containers.tsx, container.types.ts)

New: Export Tab (tab key '6')

A dedicated Export tab is added to the Containers page alongside Missing, Under-Replicated, etc. It contains:

Submit Controls:

  • Dropdown to select container state (Missing, Under-Replicated, Over-Replicated, Mis-Replicated, Replica Mismatch)
  • "Export CSV" button — POSTs to backend and immediately shows the job in the table below

Active Exports table (hidden when empty):

  • Columns: Job ID (8-char + full ID tooltip), State, Status (colored Tag), Queue Position (#1, #2...), Progress bar + record count
  • No pagination — always compact

Completed Exports table (always visible, paginated):

  • Columns: Job ID, State, Status, Records, Submitted, Started, Completed, Action
  • Download button (only for COMPLETED jobs) — triggers TAR file download to browser
  • Error message tooltip (for FAILED jobs)
  • Timestamps formatted as MMM D, HH:mm:ss

Polling:

  • 3-second interval using setInterval + useRef — starts when Export tab is opened or a job is submitted
  • Auto-stops when no QUEUED or RUNNING jobs remain

Error handling:

  • 429 queue-full error shows a 6-second toast with the specific message
  • All errors show clean messages (no raw HTML from Jetty)
  • Guard in fetchTabData prevents undefined API calls when Export tab is active
## What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14913

How was this patch tested?

Log Changes -


2026-04-30 15:16:48 2026-04-30 09:46:48,962 [pool-56-thread-1] INFO api.ExportJobManager: Starting export job ac16b513-f3f0-4e2d-a124-f208155697c3
2026-04-30 15:16:54 2026-04-30 09:46:54,625 [pool-56-thread-1] INFO api.ExportJobManager: Export job ac16b513-f3f0-4e2d-a124-f208155697c3 will process approximately 3040000 records
2026-04-30 15:16:54 2026-04-30 09:46:54,628 [pool-56-thread-1] INFO api.ExportJobManager: Created CSV file: part1
2026-04-30 15:17:28 2026-04-30 09:47:28,413 [pool-56-thread-1] INFO api.ExportJobManager: Created CSV file: part2
2026-04-30 15:17:57 2026-04-30 09:47:57,420 [pool-56-thread-1] INFO api.ExportJobManager: Created CSV file: part3
2026-04-30 15:17:58 2026-04-30 09:47:58,876 [pool-56-thread-1] INFO api.ExportJobManager: Created CSV file: part4
2026-04-30 15:18:00 2026-04-30 09:48:00,646 [pool-56-thread-1] INFO api.ExportJobManager: Created CSV file: part5
2026-04-30 15:18:02 2026-04-30 09:48:02,488 [pool-56-thread-1] INFO api.ExportJobManager: Created CSV file: part6
2026-04-30 15:18:04 2026-04-30 09:48:04,261 [pool-56-thread-1] INFO api.ExportJobManager: Created CSV file: part7
2026-04-30 15:18:04 2026-04-30 09:48:04,429 [pool-56-thread-1] INFO api.ExportJobManager: Export job ac16b513-f3f0-4e2d-a124-f208155697c3 wrote 3040000 records across 7 files
2026-04-30 15:18:05 2026-04-30 09:48:05,730 [pool-56-thread-1] INFO api.ExportJobManager: Created TAR archive: /tmp/recon/exports/export_missing_webui_ac16b513.tar
2026-04-30 15:18:05 2026-04-30 09:48:05,755 [pool-56-thread-1] INFO api.ExportJobManager: Deleted temporary CSV files for job ac16b513-f3f0-4e2d-a124-f208155697c3
2026-04-30 15:18:05 2026-04-30 09:48:05,755 [pool-56-thread-1] INFO api.ExportJobManager: Completed export job ac16b513-f3f0-4e2d-a124-f208155697c3 (3040000 records)
CSV_Export_Feature.mp4

@devmadhuu devmadhuu self-requested a review April 30, 2026 10:17
@devmadhuu
Copy link
Copy Markdown
Contributor

@ArafatKhan2198 as discussed, please design the solution server based for single Recon user. We don't have user based logins in Recon. We should not localize the logic at browser for job progress. All browser windows opened in multiple machines opening the recon page should see the same job and its progress. At a time only job should be allowed to run and remaining 2 should go in queue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants